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ABSTRACT 

The designing of tests has been a source of concern 
for test developers over the past decade. Various kinds of test forms 
have been applied. Among these are the fixed-form test, the adaptive 
test, and the testlet. Each of these forms has its own design. In 
this paper, the construction of test forms is placed within the 
general framework of optimal design theory. A review of various 
objective functions and methods for the designing of different test 
forms is given. The advantages of using these methods are discussed, 
and an illustration of an optimal test design is provided. (Contains 
3 figures, 1 table, and 36 references.) (Author/SLD) 
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Abstract 



The designing of tests has been a source of concern for test developers over the 
past decade. Various kinds of test forms have been applied. Among these are the 
fixed- form test, the adaptive test and the testlet Each of these forms has its own 
design. In this chapter the construction of test forms is placed within the general 
framework of optimal design theory, A review of various objective functions and 
methods for the designing of different test forms is given. The advantages of 
using these methods are discussed, and an illustration of an optimal test design 
will be given. 



Key words: optimal test design, adaptive tests, testlels, sequential procedure, 
efficiency, consistency. 
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A Review of Selection Methods for Optimal Test Design 



Since the First World War, the construction of tests in education and 
psychology has gone through a number of different stages, and tests have been 
administered in various different forms. Although at first the construction of 
tests was done by hand, the recognition that the construction of tests could be 
improved by taking into account the psychometric characteristics of the items has 
lead to alternative, and more structured methods of test construction. Perhaps one 
of the most promising directions in the construction of tests is the use of the idea 
of so-called item banks. An item bank is a very large set of items. These items 
are grouped into certain content areas and it is assumed that the psychometric 
characteristics of these items have been estimated. When such an item bank is 
available, the construction of a test is done by selecting items from the bank 
according to certain specifications. A lot of research has been done on optimal 
item selection methods. Many of these methods are based on mathematical 
programming procedures. See Adeina (1990), Boekkooi-Timminga (1989), and 
Theunissen (1986) for a review of these methods. Although the mathematical 
programming methods were mainly proposed for the construction of fixed-form 
tests, other forms like two-stage and parallel tests (Adema, 1990) can also be 
handled by these methods. Recently two computer programs based on 
mathematical programming algorithms have been developed; namely the 
CONTEST program (Boekkooi-Timminga and Sun, 1991), and the OTD program 
(Verschoor, 1991). 

The Cact that fixed-form tests do not have equal reliability or equal 
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validity over the whole range of abilities in the population, has motivated Lord 
(1971, 1980) and Weiss (1976,1978), among others, to propose adaptive test 
forms. The central idea wis that if each examinee in a sample is given an 
individually designed test, Jiis would lead to more efficient estimation of the 
abilities of these examinees. The availability of fast computers and item response 
theory (IRT) models has made the development of computerized versions of 
adaptive testing (CAT) possible. See Wainer (1990) for a review of various 
aspects of CAT. 

With the development of item banks and computerized adaptive tests, we 
special skills of the test developer were replaced by statistical characteristics. 
This development was criticized by Wainer and Kiely (1987). They argued that 
the test developer's skills are still needed in the construction process. Because 
several practical problems with the existing CAT procedures were not solved 
satisfactory, Wainer and Kiely (1987) and Wainer and Lewis (1990) proposed the 
application of so-called testlets. Testlets are actually small bundles of items, 
where examinees follow a fixed number of paths. A test may consist of a number 
of different testlets, and an examinee does not have to take every testlet in the 
test nor does an examinee have to take all items within the testlet. The many 
advantages and disadvantages of fixed-tests and adaptive tests are combined in a 
testlet design. 

The above described construction of different test forms can be regarded 
as an optimal design problem. Optimal design methods have been applied in 
various fields of research. Although most of the developments have been reported 
and applied in bioassay research, optimal design methods can also be applied in 
educational measurement. Berger (1991) and Berger and van der Linden (1992), 
for example, have recently described the application of optimal design methods 
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for the designing of optimal samples for item parameter estimation in IRT 
models. 

The objective of this paper is to give a review of the different optimal 
design methods and criteria for item selection for the construction of different test 
forms. This review places these methods within the general framework of optimal 
design theory. Silvey (1980) and Ford, Kitsos and Titterington (1989) give a 
review of optimal design research for nonlinear models. The present paper also 
indicate*, that the optimal design methods all have the same characteristics and 
can be applied to any IRT model. This review not only includes the already 
known methods but also introduces some alternative selection methods which 
may prove useful in the future. 

First a description of a test design will be given. Then the two most 
frequently applied information measures will be described, and finally the 
different criteria lor the selection of items for different situations will be 
reviewed. 

Test Design 

A test design is characterized by the pattern of the examinee-item 
combinations. Actually, a test design is connected with a particular test form. For 
example, a fixed-form test where examinees all take the same items in the test, 
may be designed in such a way that the items are ordered from very easy to 
extremely hard. If examinees take the items in the test starting with the most easy 
item, and stop whenever they give a wrong answer to an item, then the most able 
examinees will have to answer more items then the examinees with a low ability 
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level. When examinees are also ordered according to their ability level, then the 
scores of a test design will approximately hav^ a Gutunan scale pattern. In Figure 
1 an example of such an approximate Guttman test design is given. The crosses 
in Figure 1 indicate the examinee-item combinations. The 16 examinees take a 
test consisting of 20 items. The examinee with the lowest ability level only takes 
two out of 20 items and the most able examinee takes 19 out of 20 items. It 
should be emphasized, that the empty cells in the score matrix of the approximate 
Gutunan Test design are empty by design, i.e. the design will determine whether 
a response is available or not. 



Insert Figure 1 about here 



Adaptive tests also have special designs. Most adaptive tests are 
administered in such a way, that each examinee takes a different set of items. 
The full Men matrix of responses of N examinees on a total of n different items 
will therefore contain a lot of empty cells. In an adaptive test design, the pattern 
of the cells in the Nxn response matrix is determined by the adaptation process. 
The design pattern connected with an adaptive test form is certainly not fixed, 
and may be completely different for examinees having the same ability levels. An 
example of an adaptive test design is also given in Figure 1. Note that for this 
particular design an equal number of 10 out of 20 items is administered to each 
of the 16 examinees. 

The designs connected with testlets are more fixed than adaptive test 
designs. A test containing several testlets will usually have a limited number of 
paths for an examinee to run though. Depending on their responses to previous 
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items, examinees may take different items in the lestlet and may even take only 
some of the testlets in the test The response pattern in a testlet design often 
follows a kind of branching scheme. Actually, two different types of branching in 
a testlet design may be distinguished. Examinees may not have to take every 
testlet in the test. Such a branching may be referred to as between testlet 
branching. When a testlet is structured in such a way that a fixed number of 
branches of the items within that testlet is made possible, this will be referred to 
as within testlet branching. The third diagram in Figure 1 displays a typical 
within testlet design connected with a hierarchical testlet (Wainer & Kiely, 1987). 
In this example, the 16 examinees all take the first item. Then, depending on 
their response to the first item, they take the second or the third item t and so on. 
Testlet designs are not as flexible as adaptive test designs, but more flexible than 
the designs for fixed-form tests. 

For the description of a test design some notation is needed. Suppose 
that we wish to construct an optimal test design for a sample of N examinees 

(j=\ y ... t N) and n distinct items 0=1 n). Let the matrix U = [u^] represent the 

response pattern. If the 9-scale with all possible abilities is divided into c distinct 
categories 8^, such that 1 < c < A/, then these categories can be gathered in 8' = 

(8 j, 82, 83 8 C ), where 8 e Vf y and E c is a c-dimensional set of real 

numbers. Corresponding with the vector 8 is a vector of weights, W = (wj, W2, 
w 3» » w c^' Th ese weights can be used in different ways. 

The weights in W may be used to characterize the distribution of the 
sample for which the optimal test is constructed. If, for example, all weights in W 
are equal, then the sample will have a uniform distribution for the abilities. By a 
suitable selection of weights a normal ability distribution can also be 
approximated. The weights can also be used to select only a few 8-levels. If we 
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wish to find an optimal test design for only two extreme 0-levels, then all but 
the corresponding two weights Wj will be equal to zero. The weights can also be 
used to give more weight to certain 8y- values than to others. Weights can also be 
used to emphasize the sizes of the intervals between the different Gy-levels. Some 
of the criteria discussed in this chapter will make use of such weights. 

The items in the test design can be characterized by the vector of 

structural parameters £ 2 » ^3 V» where ™ ch elemenl a 

vector representing more than one item parameter. For example, for the Rasch 
model £j will represent the difficulty or location parameter. For extensions of the 
Rasch model, ^ may contain more than one parameter. Of course, items with the 
same item parameter values may be represented by the same vector ^ y 

The probability of obtaining a response can now be given by the 
function P(8y£-). The mean and variance ui the parametric family are P(Qj£j) 
and {P(fijJ^) (1- respectively, and the likelihood function for the 

data matrix U and 8 is: 

M*e*i- n n PiUjti"^ (1) 

;=1 i-l 

where is the proportion of correct responses on item i in category j of 6, and 
estimation of the parameters (8y,£;} can take place by means of the usual 
maximum likelihood (ML) estimation procedures. 

After a model Pddjfy) is chosen, the test design can be selected. The 
selection of a test design must be done in such a way, that it will lead to the 
most accurate estimation of the parameters. The problem, however, is to find 
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such test designs. More specifically the problem is to find the set of parameters 
{; connected with a certain test form that will enable the most efficient estimation 
of the parameters in the sample characterized by 
{9, W). 

The problem of finding optimal test designs cannot be answered in any 
general sense and will depend on a number of factors. First, the assumed 
response model will determine the final outcome. An optimal test design for the 
Rasch model will generally not be optimal for the rvo-parameter logistic model. 
Fortunately, however, the methods for finding optimal test designs can be applied 
to practically any parametric IRT model. 

A second problem is connected with the test form. An optimal test 
design will differ per test form. For example, an optimal design for a fixed-form 
test may not be optimal at all for an adaptive test, and vice versa. 

Another problem is connected with the parameters themselves. The 
accuracy of the parameter estimates will depend on the amount of information in 
the data, and test designs may differ in their amount of information. The variance 
oi the estimators is usually inversely related to the amount of information in the 
data, and some suitable information measure must be chosen before one can find 
an optimal test design. 

Finally, a selection criterion for the items must be chosen. Since the 
optimality of a test design will depend on the optimality criterion that was used, 
the choice of criterion may be crucial. In fact, two alternatives can be 
distinguished. The first kind of criterion is based on all parameters in {8, W). 
This enables a simultaneous optimization procedure for all the parameters in 8. 
The second kind is formulated on a subset of parameters or even for single 
parameters, and allows for a stepwise optimization, i.e. for each of the 8.-vaIues 
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separately. Although the latter group of criteria has been frequently applied in 
adaptive testing, these criteria can also be used for the designing of fixed-form 
tests. 

In the following sections the two most frequently applied information 
measures will be described. 

» 

Information Measures 

Many different types of information measures for the estimation of 
parameters have been proposed. Two of the most frequently applied information 
measures are Fisher's information measure and the Bayesian measure, which is 
based on the inverse variance of the posterior distribution. 

Let the information measure be symbolized by J(9y). Then Fisher's 
information function connected with the parameter Gy is defined as: 

J(dj) = E { JL Log L{wfr£}} 2 , (2) 



where JL[u;B£] is the likelihood function. Higher values for J(QJ) indicate that 
more information on the parameter Qj is available in the sample. Fisher's 
information has been the most often used measure in test construction. Not only 
the mathematical programming methods for the construction of fixed-form tests 
make use of this measure, this measure is also very popular for the construction 
of adaptive tests. 
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The second measure is the Bayesian measure. The Bayesian approach to 
test construction was first proposed by Owen (1975). Instead of using Fisher's 
information on the ability parameters, Owen (1975) proposed to use the posterior 
variance. To our knowledge, no mathematical programming procedure based on 
the maximization of the inverse posterior variance criterion has yet been 
proposed. To do this, one must first formulate a suitable prior distribution on the 
abilities being measured. Then, after the selection of response data, the posterior 
distribution has to be developed by combining the prior distribution with the 
response data This means that the use of a Bayesian selection criterion to select 
items for inclusion in a fixed form test would not be very practical. On the other 
hand, the implementation of such a Bayesian procedure in the mathematical 
programming models for two-stage or multi-stage testing procedures proposed by 
Adema (1990) would be feasible, and it would probably increase the efficiency of 
the selection procedure, at least when a suitable prior is selected. 

When the expected posterior variance is used for item selection, then: 

Wj) = E { Var~ l {dj\M(dj)) } . (3) 

where M(Qj) is the prior information on 9y. 

For all the parameters in (0, W], the information measures J(Qj) can be 
grouped into the following vector: 
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J(9\$Y = I J(0i), J(82), J(e 3 ), ... , J(9 C ) ] . (4) 

This vector 7(9 1 contains all available information on the parameters 9 in the 
data and optimally of a test design is usually represented by a function of the 
two vectors J(9 1 and W. It should be noted, that for multidimensional IRT 
models the vector 8 will become a matrix and J(9 1 will also be a matrix, but 
the optimality procedures will generally remain the same. 

A Class of Optimal Design Criteria 

The above given information measures are related to the amount of 
uncertainty of the estimators of the elements in 8. Optimality of a test design can 
be defined simultaneously for all the parameters in {6,1V} by considering a 
function 4>(.) of J(9|£) and W. Such a simultaneous optimization has the 
advantage that it will lead to an optimal design for the whole sample of 
examinees characterized by (8.W), and also takes into account the shape of its 
ability distribution. 

An optimal test design is a design for which the function 4> 
{7(8 1 has the largest possible value, and the problem of finding an 

optimal test design is actually the problem of maximizing a real-valued concave 
objective function, i.e.: 
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maximize * {7(9 1 $),W) 



(5) 



subject to 



max 



(6) 



where is some prior specified maximum sample size. In most cases this 

maximization problem is not easy to solve, and the solution will generally depend 
on the function <!>(.) and the information measure being used. 

Kiefer (1974) considered a general class of optimal ity criteria <P (.) and 
discussed their properties within an approximate equivalence theory. Members of 
this class are the so-called product criterion, the sum criterion, and the minimum 
value criterion. Conceptually, the product criterion can be regarded to correspond 
with the well-known geometric mean and the sum criterion may be regarded to 
correspond to the arithmetic mean. This class not only includes these 
simultaneous optimally criteria, but also includes criteria which are suitable for 
stepwise optimization. In Table 1 different optimality criteria are displayed, and 
each of these criteria will be discussed in the following sections. 



Insert Table I about here 
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Optimality Criteria for Simultaneous Optimization 

Product criterion 

The first criterion is a product criterion. The most frequently applied 
form is the determinant criterion. Usually this criterion is defined as the 
determinant of an inverse variance-covariance matrix of the estimators, and is 
often referred to as the D-optimality criterion. This measure was first proposed 
by Wald (1943), and it is also known as the generalized variance criterion 
(Anderson, 1984). It can be shown that this criterion is related to Shannon's 
(1948) information measure of uncertainty about the parameters (Berger, 1991). 
If the vector J(8 1 represents the main diagonal of a diagonal matrix, then the 
determinant of that matrix is the product ot the main diagonal elements. For an 
optimal test design the product criterion will become: 

c u , . 

* w\$).w) = n 7 • (7) 

M 

This criterion has many advantages. Perhaps one of the main reasons for 
using this criterion is that it has a natural interpretation. It can be shown that it is 
related to the volume of a confidence region in the parameter space. This means 
that it can be used to formulate a confidence interval round the parameter 
estimates. A second feature is that it does not depend on the scale of the 
independent variable, l or die well-known one-, two-, and three-parameter IRT 
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models, this means that the D-optimality criterion is invariant under linear 
transformation of the logit scale. Finally, it must be mentioned that its upper 
bounds for the two-parameter logistic model have been derived by Khan & Yazdi 
(1988). This means that the actual optimality function value can be compared to 
the maximal achievable value of the criterion. Such a comparison, for example, 
was done by Berger (1992b) for two-stage sampling designs. 

The D-optimality criterion has also been appealing because of its 
equivalence with other criteria. The general Equivalence Theorem of Kiefer and 
Wolfowitz (1960) shows that the D-optimality criterion is equivalent to the G- 
optimality criterion, which minimizes the maximum variance of the predicted 
response over the design space. This result indicates that a design is D-optimal if 
and only »f it is G-optimal. 

The D-optimality criterion also has some disadvantages. The first 
disadvartage is that it is generally not sensitive to misspccifications of the model. 
For example, Abdelbasit & Plankett (1983) showed that for the two-parameter 
logistic model a D-optimal sampling design for the estimation of the two 
parameters of a single item consists of only two distinct ability levels. Berger 
(1992a,b) presents figures of these sampling designs. Because such D-optimal 
designs are only based on two distinct design points or ability levels, they may 
not be sensitive to changes in the model specification. Not only minor, but also 
large deviations in the item characteristic curve may not be detected with data 
collected according to these designs. Although these problems have been 
encountered for sampling designs, it can be inferred that these problem will also 
occur when the D-optimality criterion is applied to test designs. 

Another disadvantage of this criterion is that models with a different 
number of parameters cannot be compared with each other, because the function 
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depends on the number of parameters being used. It should be noted, however, 
that this problem also occurs with the other functions. 

Sum criterion 

A second criterion is the trace or A-optimality criterion. For the test 
design this function is defined as a weighted sum of information measures 
connected with the c Qj- parameters in the sample: 

c 

4> {7(e!EW} = £ wj J(Qj) . < 8) 
7=1 



This criterion has also often been applied in optimal design research. 
Although there are cases in which A-optimality is more easily demonstrated than 
D-optimality, the A-optimality criterion does not have the same advantages as the 
D-optimality criterion. It is not invariant under linear transformation of the 
parameter scale and its upper bounds depend on the actual values of the 
parameters themselves. Although this criterion may seem more appealing to 
practitioners than the D-optimality criterion, it has hardly been applied in IRT 
modelling. An example of such a sum criterion for mathematical programming 
methods has been given by van der Linden and Boekkooi-Timminga (1989). 

Minimum value criterion 

This criterion may have different forms. Either the minimum value of 
the information on the parameters is maximized, or the maximum value of the 
inverse information or asymptotic variance is minimized. An alternative 
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formulation is based on the smallest eigenvalue of the information matrix, and is 
referred to as the E-optimaJity criterion. For IRT models and test designs the 
smallest value of the vector 7(9 | is maximized: 

c 

0 iJ(Q\$\W) = min (7(6/)} . < 9 > 

H 

This criterion is often called a MAXIMIN criterion. An example of a MAXIMIN 
criterion used as objective function in mathematical programming is given by van 
der Linden & Boekkooi-Timminga (1989). 

Optimally criteria for Stepwise Optimization 

The function 0 (.) is defined for the whole set of parameters {6,W}. In 
some cases, however, optimality for some subset of parameters or for each single 
parameter may be of interest. For example, a test constructor may want to find a 
test design that is optimal for the estimation of only the lower ability levels in a 
sample. Such a selection of the parameters in 9 can be established by setting the 
weights corresponding to the higher 8y-values equal to zero. The problem is then 
to find an optimal test design for the subset {0 5 , W s }, where 1 < s < c is the 
number of parameters in the subset. In many cases, the estimators of the 
parameters in the subset will not be independent of the estimators of the 
remaining parameters. In these cases, this dependency should be taken into 
account when items are selected. The solution to the maximization problem for a 
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reduced set of parameters is often referred to in the optimal design literature as 
4> 5 - cptimality. 

In this section, criteria which are formulated for a single parameter 0y, 
i.e. for a single examinee, will be given. These criteria are special cases of the 
above given criteria for a whole sample of examinees. Instead of a simultaneous 
maximization, these methods allow for a stepwise optimalization for each single 
parameter separately. These criteria have been mainly used for the construction of 
adaptive tests. 

In adaptive testing, the construction of a test is individualized for each 
examinee, and the item selection criterion is formulated for each examinee 
separately, that is for a single parameter. A distinction between construction 
methods for fixed- form tests and adaptive tests, is that item selection in adaptive 
testing is based on previous responses. If an examinee x has an ability 8^ (x e 
jV), then the selection criterion is based on an estimate of the parameter 8^ 
instead of on the parameter itself. Given such a provisional point estimate, items 
are selected with the largest information on the ability estimate, i.e.: 

4> {J(Q\Z,),W} = J(d x ). (10) 

This criterion was first suggested by Lord (1977) and has been referred to as the 
maximum information selection criterion (Thissen & Mislevy, 1990). An adaptive 
test is composed sequentially, alter successive administration of the selected 
items. Compared to the fixed-form test, the adaptive test form may lead to more 
efficient estimates of the ability, but the stepwise search, for each examinee 
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separately, through a relatively large number of items will, of course, be more 
time consuming than for the construction of a fixed-form test. 

There is, however, a disadvantage. Criterion (10) is based on a 
provisional point estimate of 0 r Especially when the information measure is 
based on relatively few items, the uncertainty of the estimator may be very high. 
In these cases the selection of items may be improved by applying a criterion 
which will take into account the uncertainty of the estimators. Some objective 
functions that do take into account the uncertainty of the estimators have been 
proposed by Veerkamp & Berger (1994). 

A 100(l-a)% confidence interval for 8^ with lower limit 8^ and upper 
limit 8fl can be formulated by means of the well-known property that its 
estimator is asymptotically normally distributed with mean 9^ and variance 
J(Q X \^)' 1 which may be replaced by (10). If the pair of vectors (8 5 ,VVy) 
contain all discrete values of the abilities lying within the confidence interval for 
8 r so that the first (lowest) 8^- value is 8^ and the highest (last) 8y-value is 8^, 
then the area under the information function with limits 8^ and 8^ may be 
roughly approximated by: 

R 

<J> U(8U),W) = £ (0, 7(6;), < n > 
JmL 

where (Dy = |8y.j - Qj\. These weights are used to include the size of the 
intervals between the distinct Gy-levels, and as such enables approximation of the 
area under the information function. Item selection in adaptive testing may be 
improved by applying this intcrv;il criterion instead of the maximum (point) 
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information criterion. 

An extension of the interval selection criterion is also possible by 
including additional weights. If, for example, more weight is given to the 
information measure J(Q | when the likelihood is high and less weight is given 
when the likelihood is low, then a likelihood weighted selection criterion may be 
formulated as: 

c 

<I> lXQ\£>hW} - £ L[u in) \Bj£ in) ) VjJWj) , (12) 

where L[J n hQjjtf ni ] is the likelihood for the responses of the n already 
administered items. It should be noted, that equations (11) and (12) arc equivalent 
to the weighted sum criterion given in equation (8). Only the weights differ. 
Some advantages of these criteria are given by Veerkamp & Berger (1994). 
Because of the additional use of the amount of uncertainty of the estimator 
these criteria are expected to perform at least as good as the maximum 
information criterion. Simulated results given by Veerkamp and Berger (1994) 
seem to support this conjecture. 

An Illustration 

One of the main features of simultaneous optimization criteria is that the 
shape of the ability distribution can be taken into account. An illustration of this 
feature is presented in Figures 2 and 3. Suppose that we have an item batik with 
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an infinite number of items. These items have been calibrated by means of the 
two-parameter IRT model, and cover the full range of combinations of b- e <- 
3,+3> and a ; z <0.5,3.0>. The product criterion in (7) was used to select the 
items irom the item bank. 

In Figure 2 the probability mass functions of the resulting optimal test 
designs for a positively skewed ability distribution is given for items having three 
different values of the discrimination parameter a f - = 1.0, 2.0, and 3.0, 
respectively. These functions indicate that if the items have a discrimination 
parameter a- = 1.0, the optimal test would consist of about 80% of the items 
having difficulty parameter value b- t = -1 and about 20 % of the items with 
difficulty b> t = -0.5. A very small proportion of items would have a difficulty 
parameter value b- = 2.0. When the items have a higher value for a-, the shape 
of the probability mass function on the difficulty scale will resemble the 
positively skewed ability distribution for which the test was designed. 



Insert Figure 2 about here 



In Figure 3 the optimal test designs are given for a uniform ability 
distribution. The results in Figure 3 show that for a uniform ability distribution 
the probability mass functions will also approximately have a uniform shape on 
the difficulty scale. It should be noted, that the selection of items from the item 
bank is rather artificially structured, i.e. the items are assumed to have a constant 
discrimination parameter in each test and exhaustion of the item bank does not 
play a role, because of the infinite number of items. In this case, the most 
optimal combination of piiramctcr values can be selected as often as required. For 
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small item banks, the results will be expected to be quite different. 



Insert Figure 3 about here 



Discussion and Conclusion 

An important aspect of objective measurement in education is the 
construction of test forms which are not only valid and reliable, but also will 
produce efficient estimates for the latent trait distribution for which the particular 
test is designed. In this chapter the designing of different test forms is placed 
within the general theory of optimal designs. Different methods for optimal 
design are reviewed in this paper and their properties are discussed. The main 
conclusion of this paper is that the different test forms, such as fixed-form tests, 
adaptive tests and testlets, can be constructed by means of comparable methods, 
and that these methods are actually the same as the procedures which have been 
used in optimal design theory. 

The construction of an optimal test design can be viewed as an 
optimization problem, and several algorithms for finding optimal designs have 
been proposed in the literature. Among those optimization procedures are the 
mathematical programming procedures, which have been used for the 
construction of fixed-form tests by Adema (1990) and Boekkooi (1989), among 
others. Apart from these procedures several other optimal design algorithms have 
been applied in other fields of research. Sec for example, Cook and Nachtsheim 
(1980) for a review. Perhaps the most promising algorithms for test construction 
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are the sequential design algorithms. These procedures have been studied 
extensively by Ford, Titterington and Wu (1985), Wu (1985), Wu and Wynn 
(1978) and Wynn (1970), and were applied to IRT modelling by Berger (1992ab, 
in press). The sequential construction of optimal test designs by means of the 
methods discussed in this paper is straightforward. 
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Table 1 



Different Item Selection Criteria for 


Simultaneous and Stepwise Optimization 


Simultaneous Optimization 


Stepwise Optimization 


Product Criterion 


Maximum Information Criterion 


Sum Criterion 


Interval Information Criterion 


Min. Value Criterion 


Weighted Interval Criterion 



r% 1 



Subject Index 

adaptive testing 
testlets 

fixed-form test 

information measure function 

mathematical programming 

(optimal) test design / test construction 

item bank 

item selection 

optimal sampling 

parameter estimation 

Guttman scale 

(maximum) likelihood 

Rasch model/(two-parameter) logistic model 

IRT 

Bayesian measure 
posterior variance / distribution 
prior distribution 
product criterion 
(weighted) sum criterion 
minimum value criterion 
stepwise optimization 
simultaneous optimization 



'>0 
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Subject Index (vervolg) 

determinant criterion 
D-optimality criterion 
generalized variance criterion 
Shannon's information measure 
general Equivalence Theorem 
G-optimality criterion 
trace criterion 
A-opumality criterion 
E-optimality criterion 
maximin criterion 
<I> s -opumality 

maximum information selection criterion 
interval criterion selection criterion 
weighted interval selection criterion 
discrimination parameter 
difficulty parameter 
ability distribution 
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Figure Captions 

Figure 1 Three test designs. 

Figure 2 Simultaneously designed optimal test design for a positively skewed 
ability distribution. 

Figure 3 Simultaneously designed optimal test design for a uniform ability 
distribution. 
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